# Installing
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(lubridate)
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyr)
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
## 
##     extract
library(stringr)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(RColorBrewer)
library(readr)
library(corrr)
library(ggalluvial)
library(treemapify)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggpubr)

Problem Statement

In this assignment, the focus is to implement various type of visualization plots and come up with conclusive results. The main package to focus on for this assignment is ‘ggplot’. For each task, we will come up with different type of graphs and plots, which will help us to visualize the data in different forms.

TASK 1

Generate the density plot similar to what is shown in the figure below. (Dataset: airlines_delay.csv)

## Warning: Removed 231974 rows containing non-finite values (`stat_density()`).

OUTPUT 1

In the above plot, we have computed the density plot for five different categories of the delays. To find the distribution, log10 function is used. After this task, we now have better understanding of working with the density plots.

TASK 2

Generate correlation plots for arr_flights, arr_del15, arr_cancelled, arr_diverted, arr_delay, carrier_delay, weather_delay, nas_delay, security_delay and late_aircraft_delay. The below image is just for your reference, you are expected to create a plot with labels properly aligned and not overlapping. (Dataset : airlines_delay.csv)

## Warning in text.default(pos.ylabel[, 1] + 0.5, pos.ylabel[, 2],
## newcolnames[1:min(n, : "t1.cex" is not a graphical parameter
## Warning in title(title, ...): "t1.cex" is not a graphical parameter

## Warning in title(title, ...): "t1.cex" is not a graphical parameter

OUTPUT 2

For the above visualization, the intensity of the elements are all bluish in range. This is because the intensity of color red in the bar legend goes towards the negative correlation, but because we have all positive values of the correlaton, red patterns are not observed in the plot, due to which the plot is not same as like the question (based on color pattern)

TASK 3

Based on your observations from the (Dataset: airlines_delay.csv), create any visualization of your choice.

OUTPUT 3

By the above plot, we can determine and visualize the total number of error by different airlines respectively. After the computation and visualization, one pattern can be observed that the most common type of delays in case of every airlines are delay due to arrival and delay due to being late. Thus, by above, it is clear that southwest airlines is the airlines with most number of total delays followed by the american airlines. The airlines with least number of total delays, according to the graph, are Comair Airlines and Hawaiian airlines

TASK 4

From ( Dataset: wages_jobs.csv) generate a heat map similar to the one shown below. The variable Difference is defined as the difference between number of male employees and the number of female employees. A negative value indicates a greater number of female than male employees.

## `summarise()` has grouped output by 'PUMS_Occupation'. You can override using
## the `.groups` argument.

OUTPUT 4

In the above section, we thus compute the total number of difference in male and female employee categories and visualize the results in form of a heat map. There are several color patterns that can be considered for the above presentation of the categories. However, to make the visualization according to the reference, we have used palette = ‘RdYlBu’, which clearly represents the difference between male and female employees.

TASK 5

From ( Dataset: wages_jobs.csv ) generate an alluvial chart like the one below

## Warning: Computation failed in `stat_stratum()`
## Computation failed in `stat_stratum()`
## Caused by error in `nth()`:
## ! unused argument (na_rm = na_rm)

OUTPUT 5

In the above plot, the main focus is to find the information above the people with different occupations through different years based on genders- male and female, and then visualize the results in form of a alluvial chart. To visualize the data using the alluvial chart, we require packages ggalluvial and treemapify. Based on the graph vislauization, it can be interpreted that the male and female have the same results for different occupation based on years.

TASK 6

From ( Dataset: wages_jobs.csv ) generate a stacked bar plot for the year 2018 with Occupation and Average Wage as the axis and Gender as the color

OUTPUT 6

In the above task, the key idea was to visualize the data of male and female category to determine the average wage in year 2018 based on occupations. 2018 year was filtered to get conclusive results and the categories were differenciated for both male and female. Once the results were achieved, the visualization was performed in form of a stacked bar chart for male and female, with average wage being on the x axis and occupation on the y. The coloring was done based on category.

TASK 7

From ( Dataset: occupations.csv ) generate the following tree map. The area of each rectangle is proportional to the number of people working in that Detailed Occupation.

## `summarise()` has grouped output by 'Detailed.Occupation'. You can override
## using the `.groups` argument.

OUTPUT 7

This code generates a treemap plot for the workforce distribution by occupation for the year 2018. The treemap displays the detailed occupation as a label, and the area of each treemap rectangle is proportional to the total population in each occupation. The fill color of each rectangle represents the major occupation group to which the detailed occupation belongs. The plot also has a title, “Workforce Distribution by Occupation for 2018”, and the text size for the title, legend, and labels are adjusted using the theme() function. Additionally, the font face for the labels, legend, and title is set to “bold”. The treemap is laid out using the “squarified” layout.

TASK 8

Explore Plotly in R here and create any chart of your choice – from any of the datasets provided in this homework

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

OUTPUT 8

This code explores Plotly by using the ggplotly() function to convert ggplot objects into interactive Plotly charts. The code creates two line plots, one visualizing the variation of average wage over the years, the other visualizing the variation of total population over the years, both separated by gender. Then it arranges the two plots vertically and adds annotations to the figure such as the title, source of data, and the package used for arranging the figure.

TASK 9

Pick any hex colors (minimum 3) of your choice and create a donut chart from any of the datasets provided in this homework.

OUTPUT 9

This code generates a donut plot using the ggplot2 library. It summarizes the total population of each major occupation group in the “df_occupations” data frame and calculates the fraction of the total population for each group. The plot is created by defining the y-axis maximum and minimum as the cumulative sum of the population fractions and labeling each segment with its corresponding percentage. The plot is further customized using various theme and scale functions such as setting the title, font size, color, and fill color for each segment. The final plot is stored in the “final_plot” object.

Conclusion:

After this assignment, we, as a group, performed several visualization concepts to understand each type of graph in detail. This assignment helped us to understand which type of graph suits best for a particular type of computation, which is very crucial to know when working with the data and the visualization. This assignment helped us to understand ggplot and other visualization libraries in depth. We now have a better understanding of data visualization using different graphs and plots for different type of datasets.